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ABSTRACT 

We  consider  the  design  of  decision  problems  which  maximize 
the  classification  error  for  a  given  set  of  discriminants. 

A  minimax  principle  is  proved,  which  has  applications  in  dis¬ 
criminant  analysis  and  feature  extraction. 


SOME  GENERAL  PRINCIPLES  FOR  THE  DUAL 
PROBLEM  TO  STATISTICAL  CLASSIFICATION 


I.  Introduction 

In  the  "two  class"  problem  of  statistical  classification, 
we  are  given  two  random  variables,  and  X2,  taking  values  in 
R.  We  assume  occurrences  of  type  1  (X^)  and  of  type  2  (X2) 
are  mutually  exclusive  and  have  prior  probabilities  of  a  and 
1-a  respectively.  (0<a<l).  If  x  is  observed,  we  need  to  decide 
if  x  is  of  type  1  or  type  2  in  such  a  fashion  as  to  minimize 
the  probability  of  making  an  incorrect  decision. 

If  the  probability  densities  (wrt  some  underlying  o-finite 
measure  v  on  R  )  of  X1  and  X2,  p^(y)  and  p2(y)  were  known,  we 
could  decide  by  using  the  likelihood  ratio  test. 

ap2 (x)  >  1  type  2 

(l-a)p^ (x)  <  1  type  1 

Unfortunately,  these  probability  densities  are  often  unknown 
and  the  problem  becomes  one  of  either  density  estimation  or 
choosing  a  discriminant  function  from  a  class  of  "feasible"  dis¬ 
criminants.  There  is  extensive  literature  on  this  subject  and 
we  refer  the  reader  to  [l],  [2],  [3]. 

We  define  the  dual  of  the  two  class  classification  problem 
as  follows:  We  are  given  a  set  A  of  pairs  of  density  functions 


for  the  random  variables  and  X£.  For  which  pair  (p^,p2)eA 
is  the  classification  error  maximum  among  all  pairs  in  A?  For 
this  problem,  the  structure  of  A  is  critical.  The  main  diffi¬ 
culties  are  to  transform  an  applied  problem  into  a  mathematical 
expression  for  A.  We  consider  several  examples  of  dual  problems 
which  occur  in  signal  design. 

Example  1.  The  Mixture  Problem 

Given  a  density  function  for  X,  p(x),  and  given  q^,  q2»... 
qn  density  functions  for  Y^,  Y2,...Yn  find  an  independent 
chance  device  N  (Bernoulli  r.v.)  taking  values  in  {l,2,...n} 
such  that  the  error  for  the  classification  problem  X  vs.  YN 
is  maximum.  Here  A  =  j^p(x),^  q^(x)j  :  =  l|. 

Example  2.  The  Masking  Problem 

Let  X  be  a  discrete  stationary  signal  of  length  d.  Design 
a  stationary  stochastic  (independent)  signal  M  of  length  d  such 
that 

(a)  the  d.c.  component  =  | EM^ |  < 

(b)  the  a.c.  power  =  VAr  (M^)  <  K2 

and  (c)  the  error  is  maximized  for  the  problem 

X+M  vs.  M. 

This  problem  was  discussed  in  M  where  X  and  M  were  restricted 


to  be  multivariate  normal. 


Example  3. 


A  Code  Jamming  Problem 


Let  X  and  Y  be  discrete  real  stationary  stochastic  signals 
of  length  d.  Find  a  stationary  (independent)  signal  J  such 
that  the  error  for  the  problem  X+J  vs.  Y  is  maximum. 

In  (4)  we  shall  treat  the  first  example  in  detail  by  applying 
some  general  principles  developed  in  (2)  and  (3).  We  conclude 
in  (5)  by  considering  some  theoretical  implications  to  the 
problem  of  feature  selection. 

v 


II.  Discriminant  Functions  and  a  Minimax  Principle 

A  discriminant  function  is  a  map  L:Ra-*R.  Its  error  is 
given  by 

inf  fa  Prob.  (L(X,)>t)  +  (1-a)  Prob0  (L(X0)<t}l 
-ao<  t<  +  ao  L  z  z  J 

for  the  minimum  total  error  problem.  For  a  given  L  and  a  pair 

(Pl,P2)» denote  the  above  error  by  Sa ^L; (p^ ,p^) j . 

Let  SC.  be  a  class  of  discriminant  functions.  If  for  a 

particular  dual  problem  p2/p^  £  ^  for  (P^»P2^  (or  some 

other  optimal  discriminant, i. e., Log  p2  -  Log  p^,  p2-Pi,...), 

then  we  are  interested  in  the  quantity 


max  min  a  ,  .\ 

(Pl,P2I  L  (L!(Pl-P2>) 

if  it  exists  and  (P].»P2)  eA  such  that 
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£a  (  p7  ;  (Pl'P2})  =  mTin  &  a  (L;  (Pl'P2))  = 

X  A-* 

max  £•  (  ,  >\  max  min 

(PrP2)  l  Pl  '  (Pl'P2)/  -  (Pl,p2)  L 


The  first  and  third  equalities  follow  after  a  trivial  application 
of  Bayes’  Theorem.  Even  if  p2/p1  %  £.  for  all  (p^p^cA  (or  no 
other  optimal  discriminant) ,  expressions  of  the  form 


max 

(Pl'P2) 


min 

L 


«c(L:<Pl'P2>) 


are  appropriate  for  problems  for  which  only  L  e  are  used  in 
the  classification  problem  X1  vs.  X2.  Hence,  we  continue  our 
discussion  for  (p1,p2)eA  and  L  e  SC.  .  First  we  show  that 
^a(L; (pi'p2))  is  concave  in  (p^p^  . 


Lemma  1  Assume  A  is  convex.  Then 

^x(L;  (Ypl+(1"Y)Pl'  YP2+(1_Y)P2))  - 

Y^a  (L;  (pl,p2))  +  (1_Y>  (L;  (Pl'P2})  for 

all  0$Y<1  and  (p1»P2)  ,  (p1#p2)eA. 


Proof. 


Let  (q1/q2)  =  Y(P1«P2)  +  (1-y)  (P^P^)  • 


Then 


a  Prob  {L  >  t)  +  (1-a)  Prob  (L<t) 
ql  q2 

=  a  (y  Prob  {L  >t}  +  (1-y)  Prob~  {L>t}  ) 

P1  P1 

+  (1-a)  (y  Prob  (L<t)  +  (1-y)  Prob-.  (L<t)  ) 

p2  p2 

=  Y  (a  Prob  {L  >t}  +  (1-a)  Prob  {L<t}  ) 

P1  p2 

+  (1-y)  (a  Prob~  {L>t}  +  (1-a)  Prob~  (L^t)  ) 

pl  p2 

>Y  Sa  (l»  (P1,P2))  +  (1-Y)  £a  (l;  (prp2)) 

The  result  follows  by  taking  the  infimum  of  the  above  equation 
over  all  t. 

Note  that 

&a  (  '  (pl'fl2))  =  L  ^a(L;  (Pl'P2J) 

is  concave  in  (Pl,Pj).  Hence,  ^  ,  <PrP2>) 

could  be  obtained  by  methods  of  convex  programming.  However,  for 

is  extremely  time  consuming  to 
evaluate  and  approximations  must  be  used.  We  note  further  that 
the  concavity  property  fails  to  hold  for  the  Neyman-Pearson  error 
function  at  level  S,  (px,p2)j  =  Prob2(L<tg}  where  tg  is 


most  problems  ^  ,  (p1,p2)) 
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such  that  Prob^  (L>tg)  =  g.  To  see  this,  take  =  p2  =  P2  = 
on  [0,l],  =  1  on  [-2,-l],  and  L  =  x.  Then  (l,*  35  (p^,p2) 

+  *5  (  P1/P2))  =  0<  h  +  0  =  h  £**  (  L;  (p1,P2))  +  h  (L;  (plfp2))  • 

Our  main  result  is  now  a  corollary  of  the  following  general 
minimax  theorem  which  seems  to  be  new.  The  proof,  however,  is 
quite  standard  and  will  be  included  for  completeness. 

Theorem  1,  Let  C  be  a  subset  of  a  topological  space  and  D  a 

convex  compact  subset  of  a  topological  vector  space.  Let  f(x,y): 

C  x  D  — ►  R  be  a  continuous  function  on  C  x  D,  which  is  concave 

in  y  for  fixed  x.  Suppose,  further,  that  there  exists  a  continuous 

map  x(y):  D — ► C  s.t.  f(x(y),y)  =  m(y)  =  min  f(x,y).  Then 

x 

min  max  f(x,y)  =  max  min  f(x,y). 
x  y  y  x 

Proof.  For  any  f  on  the  product  of  2  sets  C,D  we  have: 

f (x,y)  <  max  f(x,y)=^min  f(x,y)<  min  max  f (x, y) 
y  x  x  y 

max  min  f(x,y)<min  max  f(x,y)  . 
y  x  x  y 

Hence,  we  need  only  show 

max  min  f(x,y)  >  min  max  f(x,y) 
y  x  x  y 

To  this  end  consider  y*  such  that  m(y*)  =  max  m(y).  The  existence 

y 

of  y*  is  guaranteed  since  f(x(y),  y)  is  continuous  in  y.  Now 
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for  any  t  (0<t<l)  and  y  e  D, 


m(y*)  >  m  ((1-t) y*+ty)  =  f  (x( (1-t) y*+ty) ,  (l-t)y*+ty) 

>  (1-t)  f(x((l-t)y*+ty)  ,  y*)  +  t  f  (x((l-t)y*+ty),  y) 

=  (l-t)m(y*)  +  t  f  (x((l-t)y*+ty),  y) 

Therefore,  for  all  0<t<l  and  any  y 

t  m(y*)  >  t  f (x  ((1-t) y*+ty) ,  y) 

or  m(y*)  >  f  (x  ((1-t)  y*+ty  ],  y) 

Letting  t-*  0  and  using  the  continuity  of  f(.,.)  and  x(*)  we  have 
m(y*)  >  f  (x(y*)  ,  y  'j  for  all  y  e  D 

or  f(x(y*),  y*)  £  f(x(y*),  y)  for  all  y  £  D. 

Since  f(x(y*),  y*^  <  f(x,y*)  for  all  x  e  C  ,  we  have  for  all 
x  e  C,  y  e  D 

f(x(y*),  y)  <  f(x(y*),  y*)  <  f(x,y*) 

Therefore  min  max  f(x,y)<max  f(x(y*),  y)  < 
x  y  y 

f  (x (y* )  ,  y*)  <  min  f(x,y*)<max  min  f(x,y) 

'  '  x  y  x 

which  was  to  be  shown.  The  point  (x(y*),  y*)  is  called  a 
Saddle  Point  of  f(x,y). 


7 
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Corollary  1  Suppose  A  =  ir^p^a(x),  p2a(x)j;  "a"  c  A}  is  a 

convex  set  of  pairs  of  continuous  probability  densities  on 
such  that 

p^a(x)  >0  for  each  i,  IT,  and  x 

A  is  a  compact  subset  of  Rp  for  some  p 

A — ►(p^a,  p23)  i-s  coritinuous  in 

|  I  ( f ,  , f  )  i  I  =  sup  I  f .  (x)  I 
z  i=l , 2  1 

x  e  R^ 


Further,  let  be  a  class  of  continuous  discriminant  func¬ 
tions  (with  the  topology  of  uniform  convergence  on  compact  sub¬ 
sets  of  Rd)  with  P2a/Ppa  c  (or  some  other  optimal  continuous 
function  of  (p^  ,  p2  ) J  for  each  a  c  A.  Then 

min  max  ( L<‘ (Pp3/ P2a))  =  max  min  ( L;  (pl3 ' p23)) 


Proof  One  need  only  check  the  continuity  of  i?a(L;  (Pp3,F 


a  tu  _  p2 


and  the  map  A — ►(p^  /P2  ) — *  — ^  •  To  verify  the  former  note 


that  £  depends  (approximately)  only  on  the  values  of  L  on  a 


compact  set.  For  the  latter  note  that  A — ► — =*.  is  uniformly 

Pi3 


continuous  in  the  sup  norm  restricted  to  a  compact  subset  of  R  . 
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Corollary  2  Assume  the  hypotheses  of  Corollary  1  with  £  re¬ 
placed  by  $ ®  (l,  (p^a,p2a))  =  Pro^2  ^L-tL3^  w^ere  tj  is  such 

3  3 

that  Prob^  (L  >tL  )  =  3  •  Further,  assume  that  p^  ® p^  for  all 
~a  t  A.  (This  is  the  case  in  Example  1.)  Then 

min  max  £^L;  {p1a,p2a)j  =  max  min  (p1a,p2a)^ 

L  3.  cl  L 


Proof 


cl  3 

Since  p^  =  p^,  tL  depends  only  on  L.  Hence, 


£p  is  concave  in  >  ts  second  argument  and  the  result  follows  as 
in  the  case  of  £  . 


We  conjecture  that  the  minimax  result  holds  in  general  for 


e*. 


II.  K'th  Order  Solutions 

Suppose  for  a  fixed  L  e  f~  we  wish  to  find  a*  s.t.  ^a(L;  (p1a,p2a)) 
is  maximized.  If  we  assume  that  the  density  functions  of  the 
L(p^a)  are  completely  characterized  (dif ferentiably)  by  their 
means,  variances,  ....,  K'th  moments  about  the  mean  and  that 
Sa  is  differentiable  as  a  function  of  the  difference  of  the  means 
of  L  squared,  the  variance  of  L  under  hypothesis  1,  the  variance 
of  L  under  hypothesis  2,  ...,  the  K'th  central  moment  of  L  under 
hypothesis  2;  then  letting  v^-1  Caj  =  j'th  central  moment  of  L(p^a) 
and  differentiating  £a(li; (Pia#P2a))  we  obtain  that  a  miximal  a" 
satisfies 
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-  e1  v[(v2xu)  -  v11(t))2]  +  53  ejwjfa)  =  o 

where  the  8^'s  are  (usually  unknown)  scalars  corresponding  to 
the  partial  derivatives  of  WRT  the  various  moments.  Hence, 
a  maximal  t  is  a  critical  point  of  the  objective  function 

2  K 

(*)  -  B1/v21("a)  -  v11(a)j2  +53  Y  ^i  vi  ^ 

i=l  1=2 

If  Sa  is  convex  in  the  above  arguments  at  the  critical  point 
in  question,  then  this  critical  point  is  a  local  maximum  of  (*) . 
Solving  for  critical  points  of  (*)  for  various  values  of  B..-1, 
allows  us  to  reduce  our  parameters  from  p  (dimension  of  A)  to 
2(K-1).  The  proofs  of  the  preceding  assertions  are  completely 
parallel  to  those  in  jj3j  and  will  hence  not  be  formally  presented. 
We  note  that  the  above  remains  valid  if  L  were  allowed  to  vary 
with  "a*.  (Provided  L(a)(p^a)  are  characterized  by  their  first 
K  central  moments.) 

As  a  first  example, let  L(a)  =  In  P23/P^3  and  consider  the 
first  order  solution  which  is  given  by  the  critical  points  of 

[Ep2a  (In  p2a/pxa)  -  Ep^a  (In  p^/p^J  =  £o(a)J 

where  D(a)  is  commonly  known  as  the  divergence.  Since  the 
divergence  is  always  non-negative,  the  above  critical  points 
are  the  critical  points  of  D(a).  If  A  =  { (p^  ,p2  )}  is  convex 
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then  is  can  be  shown  (see  [5])  that  D(a)  is  convex  in  (Pi'P2) 
and  hence  that  the  critical  points  are  global  minima  of  D(a). 

We  call  such  a  first  order  solution  a  minimal  divergence  solution. 

As  a  second  example,  let  L(a)  =  L  and  assume  B1* 0  for  the 
second  order  solution  (*) .  Then  by  dividing  (*)  by  b\  we  see 
that  a  second  order  solution  is  a  critical  point  of 

(**)  a  v^U)  +  B  v22(a)  -(v^Ca)  -  v^dT))2 

where  a  and  B  are  scalars. 

Theorem  2.  Suppose  A  is  convex  and  a>0,  B>0.  Then  the 

critical  points  of  (**)  are  global  maxima  of  (**) . 

Proof.  We  may  rewrite  (**)  as 

a  E  a“(L2)  +  B  E  a“(L2)  -  a/E  a'fL))2 

Pi  P2  V  PX  / 

-  B(Ep2T(L>)2-  (ep/(l>  -  Ep1?(L>)2 

This  is  clearly  concave  as  a  function  of  (Pia,p2^)  and  hence 
any  critical  point  is  a  global  maximum. 

Corollary  3  Let  A  be  convex  and  L  normal  (under  each  hypothesis) 
in  a  neighborhood  of  a  critical  point  ‘a'1  of  (**)  corresponding 
to  a  local  miximum  of  If  the  error  probabilities  of  each 

type  (at  "a^  )  are  less  than  then  ‘a’^'  is  a  global  maximum  of 
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(**)  and  a>0  ,  3>0.  We  call  such  a  solution  a  maximal  variance 
solution  (since  we  are  maximizing  a  quadratic  form  in  the  first 
two  moments  with  positive  coefficients  of  the  variances) .  There¬ 
fore  the  second  order  solution  for  this  case  is  obtained  by 
calculating  the  error  from  normal  tables  for  which  maximizes 

(*)  for  a  particular  choice  of  a,  6  and  then  maximizing  over 
a  >0,  3  >0  . 


Proof  Since  L  is  normal  near  a  and  the  error  probabi¬ 

lities  of  each  type  are  less  than  .5,  the  optimal  threshold  (the 
minimizing  t  in  the  definition  of  £  )  is  between  v.  (S^-)  and 
v21(a‘1)  .  Clearly 

- * - s — ~  <  0  at  a  and,  by  the  formulae 

S(V2  -V1  ' 

in  P 3~l  ,  - ^  >  0  at  eT1  .  Hence ,  3 1  >  0  ,  3 ,  2>  0  ,  and 

dvi2  1 

2 

B2  >0  and  a  critical  point  of  (*)  is  a  critical  point  of 

-  (v^U)  -  v/.T,)2  + 

+  (322/31)  v22(T) 

which  is  then  a  global  maximum  of 

a  v12(a)  +  3v22(a)  -(v^Ca)  -v11(a))2 
for  a  =  312/31>  0  and  3  =  322/31>  0. 
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IV.  Application  -  The  Mixture  Problem  -  Stationary  Gaussian  Case 
We  now  discuss  the  application  of  the  various  techniques  of 
II  and  III  to  the  problem  of  Example  1.  This  problem  is  of 
interest  for  several  reasons:  for  the  general  dual  problem  with 
convex  A,  maximizing  error  is  equivalent  to  finding  the  optimal 
convex  combination  of  the  extreme  points  of  A  and  many  of  the 
methods  of  solving  Example  1  extend  to  this  more  general  "mixture" 
problem;  Example  1  occurs  often  in  practical  engineering  problems  - 
(i)  In  a  certain  communication  channel,  through  which  a  random 
signal  S  may  be  transmitted,  there  are  several  noise  sig¬ 
nals  that  can  occur.  The  probabilities  of  occurrence  of 
each  type  of  signal  are  small  enough  that  we  may  assume 
that  no  two  occur  simultaneously.  Unfortunately,  the  rela 
tive  probabilities  of  the  noise  signals  are  unknown. 
Solution  of  this  mixture  problem  yields  (by  the  minimax 
principle)  a  detector  for  S  which  is  optimal  in  the 
worst  case  and  performs  at  least  as  well  in  every  other 
case. 

(ii)  In  order  to  penetrate  an  enemy  radar  defense  system 

effectively,  the  military  deploys  a  variety  of  decoys 
as  well  as  a  tactical  warhead.  Assuming  the  enemy  is 
aware  of  the  statistical  radar  signatures  of  the  various 
objects,  the  military  must  assign  an  optimal  relative 
probability  to  each  type  of  decoy.  This  is,  again,  the 
mixture  problem. 
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First  Order  Solution 


Let  Pi  (5^1  <%2  > 


qn  be  the  stationary,  normal  densities 


in  k  of  Example  1.  We  want  to  determine  weights  a^,  c^. 


. .  a 


n 


n 


a.>0;  E  a.=l  such  that  the  divergence  is  minimum.  Since  the 


divergence  is  convex  as  a  function  of  a  pair  of  probability  dist¬ 
ributions,  it  could  be  minimized  by  a  gradient  descent  method, 
provided  we  are  willing  to  perform  many  multivariate  integrations. 
If,  however,  there  are  mixtures  whose  density  is  close  to  the 

(normal)  density  p,  we  may  give  approximate  expressions  for 
n 

D(p,Z  a.q.)  which  are  quadratic  in  a.  and  hence  easily  minimized. 

i=l  11  1 

More  specifically,  let  p^  be  a  stationary  normal  density  with 

(positive  definite)  correlation  matrix  K  =  ((k^j))  and  mean 

(0,0, ...0),  let  p2  be  a  stationary  normal  density  with  (positive 

definite)  correlation  matrix  K  +  A  (a =  ((A^j))]  and  mean  ni  =  (m,m, 

and  let  A. .  and  m  be  0(e).  Then 
ID 


D(pl'p2}  =  ym2  +  £  S  q 


.m) 


A.  A.  +  0(e) 
ps  lp  Is 


where 


-E 

i 


E  bij>0 

j 


lps 


d-s+1  d-p-1 

-  E  E 

j-i  i=i 


^bj  i+p-1  bi  j+s-1  +  ^j+s-1  i+p-1  bij^ 


B  =  ( (bij) )  -  K 


-1 


and 


((q »  i®  positive  definite. 
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This  is  derived  in  the  appendix.  Applying  the  above  to  our 
mixture  problem  with  p  having  mean  (0,0,...0)  and  correlation 
matrix  K  and  mean  (iru,rtK, . .  .mj  and  correlation  matrix  Q1, 
we  need  minimize 

-  n  .  2  n  n 

v  fo  +ES  ^PS ^ai(Qip-kiP)Sai(Qis-kis) 

\i=l  /  p  S  1=1  H  1=1 

n 

subject  to  ^a^=l;  ou>0  .  This  is  a  standard  quadratic  pro¬ 
gramming  problem. 

B.  A  Second  Order  Solution 

If  d  is  moderately  large,  then  stationarity  implies  that 
optimal  (quadratic)  discriminants  will  be  approximately  of  the 
form 

2  2  2 

L  =  Aq  (x1+x2+.  .  . +x£j)  +  A.^(x^  +*2  +,,,xd  ^  + 

A_  (X.  X-+X-.X-  +  .  .  .  +  x.  .x.)  +  A,  .  (x.x,  i+X.  ,x.) 

2  12  23  d-1  d  d-1  1  d-1  d-1  d 

+  Ad(Xlxd) 

In  many  cases,  these  discriminants  will  be  approximately  normal. 
Hence,  we  may  use  the  second  order  solution  for  fixed  values  of 
A<3 '  A]_ *  •  •  •  Ad  and  then  minimize  the  associated  error  over  the  choice 
of  Ag,A^,...Ad.  The  above  second  order  solution  is  equivalent 
to  finding  critical  points  of  the  one-parameter  family  of  objec¬ 
tive  functions 
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16 


with 


and 


Jabc 


Ab-a+l  +  Ac-a+l  +  Ac-b+l 


A,  +  A  .  , 
1  c-a+1 


if  a<b<c 


if  a=b < c 
or  a<b=c 

if  a=b=c 


Fabcg  I  Ab-a+l  "  Ag-c+l  +  Ac-a+l  Ag-b+l 


+  A  i  • 

g-a+1 

Ac-b+l 

if 

a<  b<c<g 

A1  ‘  Ag-c+l  + 

Ag-a+l 

Ac-a+l 

if 

a= b<c<g 

Ab-a+l  ’  A1  + 

Ag-a+l 

Ag-b+l 

if 

a<  b<c=g 

1  A,  •  A  + 

1  1  g-a+1 

Ab-a+l 

Ag-c+l 

if 

a<b=c<g 

1  1  g-a+1 

if 

or 

a<  b=c=g 
a=b=c  <g 

1  *  2  .  .2 
\  1  g-a+1 

if 

a=b<  c=g 

Now  for  8  >0  a  critical  point  of  the  associated  objective 

function  is  a  global  maximum  and  can  hence  be  determined  by 

standard  quadratic  programming  methods.  We  note  that  it  might 

be  useful  to  develop  expressions  for  C1  involving  d  terms  and 
i  2 

D  involving  d  terms,  rendering  the  computations  more  feasible 
for  moderately  large  d. 
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V.  Minimax  Feature  Extraction 

We  now  describe  a  purely  theoretical  application  of  the 
minimax  principle  to  feature  selection.  It  is  hoped,  however, 
that  further  research  will  lead  to  practical  implementation. 

Suppose,  in  the  minimum  total  error  classification  problem, 
the  (unknown!)  densities  involved  lie  in  A,  where  A  is 

some  nontrivial  convex  set  of  pairs  of  possible  densities  which 
can  be  parameterized  as  in  Corollary  1.  We  may  assume  that, 
for  a  given  discriminant  function  (feature)  L,  the  densities 
of  L (q^)  and  L(q2)  can  be  well  enough  approximated  from  sample 
data  that  we  may  consider  them  known  but  that  the  actual  higher 
dimensional  densities  remain  unknown.  Then  one  approach  to 
solving  the  classification  problem  is  to  construct  a  sequence 
of  discriminant  functions  Lg,L^,...  whose  classification  errors 
decrease.  We  propose  the  following  sequence:  let  LQ  be  some 
arbitrary  initial  discriminant.  Let  Aq  be  the  set  of  all  pairs 
of  densities  (p^,p2)  eA  such  that  the  density  of  L^(p^)  equals 
that  of  Lq (q^) .  Solve  the  associated  dual  problem  for  AQ, 
obtaining  as  a  solution  (P^»P2) •  Then  let  =  log(p2/p1). 

L2  is  then  obtained  from  in  an  identical  fashion,  etc.  We 
claim  that  the  classification  errors  for  this  sequence  are  non¬ 
increasing.  To  establish  this  claim,  it  suffices  to  prove: 

^a(Ll!  <ql'q21)  5  ^.(V  (ql'q2>)  • 


Theorem  3 . 


Proof. 


Proof-  If  the  densities  of  and  LgCp^  are  identi¬ 

cal  to  that  of  L  (q^,  then  this  density  equals  the  density  of 

A 

^(YPi+U-YlPi)  for  any  0<y<l.  Hence,  AQ  is  convex  and  the 
minimax  principle  together  with  Bayes'  Theorem  imply 

4(L1:  ^l'V)  i  &a{hV  (Pl'P2l)  *  (1°g(P2/Pl)  ■  (P1'P2>) 

S  fa(L0!  (Pl-P2>)  *  ■ 


“I 


Let  L  (X)  =  log(p2/p1).  ThenL(X)  =  -Jj  (X-^n) t  (K+A-m2)  ~1  (X-ln) 

+  h  Xb  K  1  X  +  terms  not  varying  in  X. 

(K+A-m2)-1  =  K-1(l-AK-1  +  m2  K-1  +  (AK-1)2  +  0(e3)) 

Hence,  L(X)  =  h  XtK-1AK-1X  +  1ntK"1X  1  ( AK-1)  2X 

-  XtK-1AK-1m‘  -J5Xtm2K-1X  +  0(e3) 

+  terms  not  varying  in  X. 

Using  the  notation  <E2-E1>  Z  =  E^-E^,  we  have 

D(pi,p2)  =  <E2-El>  L  =  <E2-El>  (L-terms  not  varying  in  X) 


*  XtK-1AK-1X  =  *5  Z  L  £  b  £  A  b  xx. 

j  r  i  ri  Si  x.j  r  j 


<E2-E1>  <E  briEiublj)irj 


b~  b«)4'i 


/  d  rd^+1 
'  ^  ?(p?iSlp[S.  bri  b 


d-p+i 

5i+p+j  +  ^  br  i+p-1  bij  Vrj 


definite.  Finally  <E2~E1>  (Xfc  K-1AK-1m)  =  0  (e  3)  ,  <E2“ei> 

(h  XtK-1(AK-1)2  X)  =  0(e3),  and^-E^  Xtm2K-1X)  =  0(e3). 

We  have  now  established  that  D(p.,p0)  =  +  51  ]C  q  A,  A 

c  p  s  PS  ip  i 

+  0(e3).  It  remains  to  show  that  <^q^s>  is  positive  definite 

We  present  the  following  analytic  proof:  Let  a^a^,...  ,a^  be 

a  non-trivial  real  sequence.  Consider  A  defined  by  A^p  =  eap. 

Since  K  is  positive  definite,  K+A  is  positive  definite  for 

sufficiently  small  e  >0  .  Setting~m=0,  we  compute  D(p^,p2)>0 

In  fact  by  passing  to  a  linear  space  which  simultaneously 

2  3 

diagonalizes  K  and  A,  we  see  that  D(p^,p2)  =  t  t  +  0(e  )  for 

some  t>0  and  e  sufficiently  small.  It  now  follows  that 

Ly  q  a  a  >  0  . 

L**  nps  p  s 
P  s 
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